Plural formant speech synthesizer



Jan; 20, 1970 L. R. FOCHT ETAL 3,491,205

PLURAL FORMANT SPEECH SYlX'I'ESIZER Filed Sept. 29. 1966 3 Sheets-Sheet2 L/ /a ff /2 .l V- *IV- l I da.

. BY F7C; .f amg ,MM

Jan. 20, 1970 L. R. FoCH-r ErAL 3,491,205

PLURAL FORMANT SPEECH SYNTHESIZER Filed sept. 29, 196e s Sheets-sheet s#www 3,491,205 ILURAL FORMANT SPEECH SYNTHESIZER Louis R. Focht,Huntington Valley, and Charles F.

Teacher, Philadelphia, Pa., assignors to Philco- Ford Corporation, acorporation of Delaware Filed Sept. 29, 1966, Ser. No. 582,898

Int. Cl. H04m .7/19 US. Cl. 179-1 l 4 lClaims ABSTRACT F THE DISCLOSUREA speech synthesizer which generates three-formant speech from a firstcontrol signalv representative of the eriod of the first majoroscillation of a speech wave following each pitch pulse of the speechwave, a second control signal representative of the maximum amplitude ofeach such oscillation, a pitch signal, and a voicing signal. Thesynthesizer includes lirst and second groups of signal shaping networks,each group comprising three signal shaping networks. The first controlsignal is supplied to the 'rst group to produce three 'signals each ofwhich has an amplitude representative of the frequency of a differentformant of a speech wave, and to the second group "to produce threesignals each of which has an amplitude proportional to the amplitude ofa different formant of the speech Wave. The synthesizer also includesthree modulators the input of each of which is supplied Iwith the outputsignal of a different one of the shaping networks forming the secondgroup and with the second control signal. Each modulator is responsiveto the signals supplied thereto to produce a signal having an amplituderepresentative of the amplitude of a different formant of the speechwave. The synthesizer'also includes formant synthesizers each suppliedwith the output of a different one of said modulators, the output signalof a different one of the shaping networks of the first group, and thepitch and voicing signals. Each synthesizer produces, in response to thefour input signals, a different one of the three formant signals, andthose three formant signals are added to produce a three-formant speechwave.

Speeeh waves are highly redundant and considerable saving fin bandwidthcan be realized by proper processing 0f a speech representative signalto eliminate components not required for intelligible speechcommunication. A bandwidth of approximately 3,000 cycles per second isrequired to transmit directly an intelligible voice communication. Thisbandwidth can be reduced by a factor of or more by proper signalprocessing.

A common type of speech bandwidth compression system is the formanttracking vocoder. The formant vocoder type of speech bandwidthcompression system is based on the transmission of signalsrepresentative of the formants or vocal tract resonances of the speechwave. The conventional formant tracking vocoder system requires thetransmission of signals representative of the frequency and amplitude ofthe three principal formants in the speech wave as well as signalsrepresentative of voicing and pitch information. Thus, such a systemrequires the transmission of eight independent parameters which conveythe intelligibility of speech. Recently it has been discovered that thethree formant amplitude parameters and the three formant frequencyparameters of the prior art formant vocoder can be replaced by two newparameters. The two new parameters are the single equivalent formantfrequency and its amplitude. These two new parameters contain most ofthe phonetic information of the original six parameters and of theoriginal speech wave. According to the single equivalent formantconcept, a sound can be represented at any instant by a single fre-3,491,205 Patented `lain. 20, 1970 quency signal which may or may notcorrespond to one of the formant frequencies of the sound. By using thisconcept, a speech communication system can be built that is lesscomplicated than prior art systems and also capable of transmitting aspeech signal at a smaller bandwidth than prior art speech communicationsystems. The concept of the single equivalent formant and a speechcommunication s'ystem that utilizes the concept are described in detailin co-pending U.S. patent application Ser. Nos. 582,605, filed Sept. 28,1966 and 582,573, also filed Sept. 28, 1966 by L. Focht.

The synthesis at the receiving location of a single equivalent formantspeech wave described in aforementioned U.S. patent application Ser. No.582,573, is simple in irnplementation; however, the speech reconstructedin this manner has a nasal quality that may be distracting to theuninitiated listener. Therefore, it is sometimes desirable to synthesizefrom the single equivalent formant speech information the plural formantspeech wave that the listener is accustorried to hearing.

It is, accordingly, an object of the present invention to provide anovel speech communication system.

It is another object of the present invention to provide a speechcommunication system in which the synthesized speech wave is the typethat the listener is accustomed to hearing.

According to the present invention, the single equivalent formantsig'nal extracted at the analyzer of a single equivalent formantcommunication system is transmitted to the synthesizer of thecommunication system at the receiving location and there converted toplural formant speech. ThatA is, plural formant speech is reconstructedat the synthesizer from the single equivalent formant representativesignals transmitted from the analyzer.

The above objects and other objects inherent in the present inventionwill become more apparent when read in conjunction with the followingspecification and drawings in which:

FIG. 1 is a graph showing the frequencies of the first three formantsfand the corresponding frequency of the single equivalent formant foreach of ten vowel sounds;

FIG. 2 is a graph showing the relative amplitudes of the first threeformants and the frequency of the single equivalent formant for each ofthe ten vowel sounds of FIG. 1; y

FIG. 3 is a block diagram of a communication system according to thepresent invention;

FIG. 4 is a schematic circuit diagram of a signal Shaper portion of thesystem of FIG. 3;

FIG. 4a is a plot of the signal input-output characteristics for thesignal shaper circuits employed in the system of FIG. 3;

FIG. 5 is a block diagram of a component of the system of FIG. 3, and

FIGS. 6 to 9 are block diagrams illustrative of cornponents of the blockdiagram of FIG. 3.

In order to understand the concept of the generation of three formantspeech from single equivalent formant speech, it is necessary to knowthe relationship between the frequency of the single equivalent formantand the frequencies and amplitudes of the first three formants of aparticular sound. Referring to FIGURE l, the frequency of the singleequivalent formant and the frequencies of the first three formants forten vowel sounds are graphically shown. The vowel sounds are grouped asback, central, and front. The back, central, and front vowels arearticulated in the back, central, and front portions of the vocal tract,respectively.

FIGURE 1 shows that for each value of the single equivalent formantfrequency, there is a corresponding value for each of the first threeformants. The frequency 1i of the single equivalent formant is lowestfor the vowel sound U (boot) and progressively higher in the order ofthe vowell sounds shown in FIG. 1. The frequency of the first formantfor the ten vowel sounds also is low for the vowel sound U (boot) andincreases for the back vowels until a maximum value is reached in theregion of the central vowels and then decreases for the front vowels.The frequencies of the second and third formants again are lowest forthe vowel sound U (boot) and progressively higher in the order of thevowels shown in FIG. l. The rate of increase is less for the central andfront vowels (i.e. higher single equivalent formant frequencies) than itis for the back vowels. It can be shown that a similar relationshipbetween the frequency of the single equivalent formant and thefrequencies of the first three formants holds for other speech sounds.Since in the transmission of speech by means of single equivalentformant parameters the frequency of the single equivalent formant isrepresented by the amplitude of an electrical signal, it is possible todevelop from this latter signal signals having amplitudes proportionalto the frequencies of the first three formants of the speech wave. Allthat is required are three amplifiers each having a gain at any inputsignal amplitude level which is proportional to the ratio of thefrequency of the single equivalent formant to the frequency of aselected one of the first three formants at the frequency of the singleequivalent formant represented by that amplitude level. The resultantsignals are signals having amplitudes at any instant which arerepresentative of the frequencies of the first three formants of thesound being transmitted at that instant. These amplitude varyingsignalsmay be employed to control the frequency of three oscillatorswhich regenerate signals having instantaneous frequencies equal to thefrequencies of the first three formants of the sound to be representedat that instant.

Although, as previously described, signals at the frequencies of thefirst three formants of a sound can be produced when the singleequivalent formant frequency is known, additional information must beconveyed by the single equivalent formant frequency signal if threeformant speech is to be synthesized. The additional information thatmust be conveyed is the amplitude of each of the first three formants ofa sound. Since the relative amplitude of the first three formants of asound is the principal factor determining the single equivalent formantfrequency of the sound (attention is directed to aforementioned U.S.patent application Ser. No. 582,605), knowledge of the single equivalentformant frequency of a sound conveys sufficient information fordetermining the amplitudes of the first three formants of the soundrelative to the amplitude of the single equivalent formant.

The manner in which the relative amplitudes of the first three formantsof a sound can be determined by knowledge of the single equivalentformant frequency of the sound will be explained in conjunction withFIG- URE 2. FIGURE 2 graphically shows the relative formant amplitude indb after a 9 db per octave high frequency emphasis for the ten vowelssounds shown in FIGURE 1. The single equivalent formant frequency forthe ten vowels is also superimposed on the graph of FIGURE 2.

FIGURE 2 shows that for each value of the single equivalent formantfrequency there is a corresponding value for the ratio of the amplitudeof each of the first three formants to the amplitude of the singleequivalent formant. The graph of the amplitude of each of the threeformants for the ten vowels has a slight foldover characteristic forincreasing values of the single equivalent formant frequency. Byfoldover characteristic it is meant that the magnitude of a formantincreases with an increasm ing singie equivalent formant frequency untila maximum value of the formant amplitude is reached; beyond the maximumvalue the formant amplitude decreases in magnitude even though thefrequency of the single equivalent formant continues to increase. Thus,by using amplifiers in the synthesizer having gains controlled by theamplitude of the signal representative of the frequency of the singleequivalent formant; the amplitudes of each of the three formants ofthree formant speech can be derived from the signal Vrepresentative ofthe amplitude of the single equivalent formant for each frequency valueof the single equivalent formant.

The block diagram of FIG. 3 shows the analyzer and synthesizer of thesingle equivalent formant communication system of the present invention,An electrical representation of a speech wave, such as produced by astandard telephone carbon microphone (not shown) is supplied to a singleequivalent formant frequency detector 2, a single equivalent formantamplitude detector 4, and a pitch detector 6. The output of pitchdetector 6 is supplied to the detectors 2 and 4 and to a voicingdetector 8.

FIG. 6 is a block diagram of a preferred form of the single equivalentformant frequency detector 2 of FIG. 3. It comprises a circuit formeasuring the period of the first major oscillation of the complexspeech wave after each pitch pulse thereof, and hence, the inverse ofthe frequency of the single equivalent formant. The electrical signalrepresentative of the input speech Wave is supplied through an amplifier60 and a high frequency pre-emphasis network 62 to the input of a highgain threshold circuit l64, such as a Schmitt trigger. Network 62, whichincludes a capacitor 66 and a resistor 68, acts as a differentiator,emphasizing the high frequency components of the input speech wave. Highgain threshold circuit '64 is set to produce an output only in responseto one polarity of the differentiated input speech wave. The outputsignal of circuit 64 is supplied to one input terminal of a bistableswitching circuit 70. The output of pitch detector 6, whose constructionis explained hereinafter, is supplied to a second input terminal ofcircuit 70. Bistable switching circuit 70 is coupled by means of a pulsewidth-to-amplitude converter 72, which may take the form of a rampgenerator, to the input of a sample and hold circuit 74. The output ofthe sample and hold circuit 74 is a signal of slowly varying amplitude,the instantaneous amplitude of which is inversely proportional to thefrequency of the single equivalent formant.

FIG. 7 is a block diagram of a preferred form of the single equivalentformant amplitude detector 4 of FIG. 3. The input speech waveform issupplied to a peak detector 76 by means of a logarithmic amplifier 78. Asample and hold circuit 80 is coupled to peak detector 76 and to a lowpass filter 82. Pitch pulses from the pitch detector 6 gate the sampleand hold circuit 80 to effect measurement of the logarithm of the peakamplitude of the complex speech wave. Filter 82 removes the highfrequency components from the output signal of circuit 50, therebyproviding a slowly varying signal proportional to the logarithm of theamplitude of the single equivalent formant.

FIG. 8 is a block diagram of a preferred form of the pitch detector 6 ofFIG. 3. The input speech wave is supplied via a high 'frequencypre-emphasis network 84 to a non-linear or logarithmic amplifier 86. Theoutput of amplifier `86 is coupled to a peak detector 8'8 which has along time constant and to a peak detector which has a short timeconstant. Peak detector 90 is coupled by a voltage threshold conductiondevice 92, such as a Zener diode, and an emitter follower network 94 tothe output of peak detector 88, which is coupled to a differentiatingand amplifying network 96. Since the potential difference between theoutput signals of detectors 88 and 90 is small immediately after theoccurrence of a pitch pulse, voltage threshold conduction device 92 doesnot conduct immediately after such occurrence. Hence those harmonicpeaks in the input speech wave which occur immediately after a pitchpulse are not detected. When the potential difference between the outputsignals of detectors 88 and 90 is sufficient to initiate conduction ofdevice 92, the peak detector follows the discharge characteristics ofthe short time constant detector 90. Hence the peak detector detectspitch pulses even when there is a rapid decrease in the amplitude of theinput speech wave. Accordingly, the output signal of network 96comprises pulses the repetition rate of which is the same as the pitchrate of the input speech wave.

FIG. 9 is a block diagram of a preferred form ofthe voicing detector 8of FIG. 3. Pitch pulses from the pitch detector 6 are supplied via apulse width-to-amplitude converter 98, such as a ramp generator, to theinput of a rst sample and hold circuit 100. A differentiator network 102couples sample and hold circuit 100 to a second sample and hold circuit104. Since the output signal of ditferentiator network 102 has amplitudepeaks only when the repetition rate of the lpitch pulses is irregular,the value of the output signal of circuit 102 is zero when therepetition rate of the pitch pulses is regular (voiced sounds) and otherthan zero when the repetition rate of the pitch pulses is irregular(unvoiced sounds).

The construction and operation of detectors 2, 4, 6 and 8 are describedin more detail in the aforementioned copending U.S. patent applicationSer. No. 582,605

The signals generated by the detectors 2, 4, 6 and 8 are transmitted inany convenient manner, for exampleby .conventional wire facilities orelectromagnetic systems, to

a synthesizer network. For example, the detector signals can betransmitted by continuously varying the amplitude of an RF carriersignal in accordance with the amplitude of the detector signals. If thesignals from the detectors are transmitted directly rather than as amodulation of a carrier wave, an amplitude voltage reference level couldbe established at the receiver and the amplitude of the transmittedsignal compared therewith.

The signal from the single equivalent formant frequency detector 2 issupplied through an amplifier circuit 10 and a shaper circuit 12 to theinput of a first formant synthesizer network 36, through a shapercircuit 14 to a second formant synthesizer network 38, and through ashaper circuit 16 to a third formant synthesizer network 40. Synthesizernetworks 36, 38 and 40 have their output terminals coupled together. v

FIG. 4 is a typical schematic circuit diagram of section 11 of the blockdiagram of FIG. 3. The input-output response characteristics ofamplifiers 12 and 14 are shown in FIG. 4a. A circuit similar to thecircuit diagram of shaper 14 could be used as the circuit for Shaper 16.The values of the load resistors of the circuit of shaper 16 would bechosen to produce the desired non-linear inputoutput signal relationshiprequired of Shaper 16.

The signal from the detector 2 is also supplied to amplifier networks18, 20, and 22. Amplifier networks 16, 20, and 22 are coupled by meansof Shaper networks 24, 26, and 28, respectively, to modulators 30, 32,and 34, respectively. The circuitry of amplifier networks 18, 20, and 22may be similar to the circuitry of amplifier network 10 and thecircuitry of sha-per networks 24, 26, and 28 may be similar to thecircuitry of shaper network 12.

The signal. from the single equivalent formant amplitude detector 4 isalso supplied to amplitude modulators 30, 32, and 34. Modulators- 30,32, and 34 are coupled to snythesizer networks 36, 38, and 40,respectively.

Each of synthesizer networks 36, 33 and 40 can have the structure shownin block diagram in FIG. 5. For synthesizer network 36, Shaper 12 isconnected to the input of an oscillator 42 the output signal of which issupplied via an amplitude modulator 43 to a d-eemphasis network 45. Theoutput signal of modulator y30 is suppiied to one input of an amplitudemodulator 44 the output signal of which is supplied to modulator 43 viaa peak detector 47. The pitch and voicing signals from the detectors 6and 8 are supplied respectively to oscillator 46 and linear modulator48. Modulator 48 also receives -a signal from noise generator 49. Theoutput signal of linear modulator 48 is supplied tofrequency-controllable pitch oscillator 46, and the output signal of`oscillator 46 is supplied to frequency-controllable formant oscil--lator 42 and to amplitude modulator 44. Synthesizer networks 38 and 40are similarly connected, lwith shapers 14 and 16 respectivelysubstituted for Shaper 12 and modulators 32 and 34 respectivelysubstituted for modulator 30.

The circuit of FIG. 3 functions in the following manner. Amplifiercircuit 10 and the Shaper circuits i12, 14, and 16 modify the inputsignal which has anamplitude representative of the frequency of thesingle equivalent formant to produce three control waveformsat theinputs to the networks 36, 38, and 40, respectively, havinginstantaneous amplitudes corresponding to the-frequencies of the first,second, and third formants (FIG. 1)'...Amp1ifier circuits 18, 20, and 22function in conjunction with shaper circuits 24, 26, and 28 to modifythe input signal from detector 2 to produce waveforms at the inputs tothe modulators 30, 32, and 34, respectively, proportional to thelamplitudes of the first, second, and third formats (FIG. 2). Thewaveforms proportional to the relative values of the amplitudes of thefirst, second,fand third formants modulate the signal from thesingle'ifequivalent formant amplitude detector 4 to produce controlsignals at 4the inputs to the networks I36, 38, and 40, respectively,representative of the absolute amplitudes of the rst, second, and thirdformants. f.

The inputs to synthesizer networks 36, 38, and 40 contain all thephonetic information needed to pioduce the first, second, and thirdformants of human speech, respectively. Referring specifically tonetwork 36,? oscillator 46 (FIG. 5) produces a pitch signal `which issupplied to amplitude modulator 44 and to frequency controllable formantoscillator 42. Modulator 44 amplitude modulates the pitch signal inresponse to the output signal of modulator 30. The pitch signal suppliedto oscillator 42 controls the repetition rate of the frequency-modulatedsignal produced by oscillator 42. The frequency ofthe latter signalbetween successive pitch signals is determined by the amplitude of theoutput signal of Shaper 12. The amplitude-modulated pitch signalproduced by modulator 44 undergoes peak-detection by detector 47, andthe signal produced by detector 47 in response to the amplitude-modulatethe frequency-modulatedfputput signal supplied thereto by oscillator 42.The resultant amplitude-modulated, frequency-varying signal smoothed byde-emphasis network 45 to produce a signal representative of the firstformant of the speech wave being synthesized.

In a similar manner networks 38 and 40 produce signals respectivelyrepresentative of lthe second .and third formants of the speech wavebeing synthesized. The signals representative of the first, second andthird formants are summed to obtain a synthesized threeformant speechwave. The operation of the network of FIG. 5 is described in detailinthe aforementioned copending U.S. patent application Ser. No. 582,573.

Although the sys-tem for generating three formant speech from singleequivalent formant speech has been described as using particularamplifying and shaping circuits, it is obvious that other circuits thatwill produce the same values of amplitude and frequency control signalfor a particular single equivalent formant frequency can be used.

The system of the present invention provides a major advantage overprior art communication systems because it permits plural formant speechto be synthesized from atnansmitted single equivalent formant signal.This allows the listener to hear the plural formant speech that he isaccustomed to hearing while taking advantage of the decreased data rateand bandwidth of the transmitter characteristic of single equivalentformant speech transmission.

While the invention has been described with reference to certainpreferred embodiments thereof, it will be ap parent that v-ariousmodifications and other embodiments thereof will occur to those skilledin the art within the scope of the invention. Accordingly we desire thescope of our inveniion to be limited only by the appended claims.

We clair-n:

l. A speech synthesizer for synthesizing a multiformant speech wave inresponse to a first input signal representative at any given time of theperiod of the first major oscillation of a speech wave occurring afterthat 1pitch pulse'of said wave which immediately precedes said giventime, a second input signal representative at said given time of themaximum amplitude of said first major oscillation, a pitch signal and avoicing signal, said synthesizer comprising a first group of signalshaping networks supplied with and responsive to said first input signalto produce a first plurality of signals each of which is representativeof the frequency of a different formant of said speech wave; a secondgroup of signal shaping networks supplied with and responsive to bothsaid first and second input signals to produce a second plurality ofsignals each of which is representative of the amplitude of a differentformant of said speech. wave; plurality of formant synthesizer networkssupplied with. and responsive to said pitch signal, and having theiroutputs coupled together; and means for supplying those signals of saidfirst and second plurality of signals that are representative of thesame formant of said speech wave to only one of said plurality ofsynthesizer networks.

2. The synthesizer according to claim 1 wherein each of said secondgroup of signal shaping networks includes an amplifier, a signal shapersupplied with and responsive to the output of the amplifier, and amodulator supplied with and responsive to the output of the signalShaper; said synthesizer also comprising means for supplying said firstinput signal to each of said amplifiers, and means for supplying saidsecond input signal to each of said modulators.

3. The synthesizer according to claim 2 wherein each of said formantsynthesizer networks includes a first signal controlled oscillatorsupplied with and responsive to a signal representative of the frequencyof one formant, an amplitude modulator supplied with and responsive to asignal representative of the amplitude of said one formant, a secondsignal controlled oscillator supplied with and responsive to said pitchsignal, and means coupling the output of said second signal controlledoscillator to an input of said first signal controlled oscillator and aninput of said amplitude modulator.

4. The synthesizer according to claim 3 wherein each of sid formantsynthesizer networks further includes a noise generator, a linearmodulator supplied with and responsive to both said voicing signal andthe output of said noise generator, means coupling the output of saidlinear modulator to an input of said second signal controlledoscillator, a peak detector supplied with the output of said first'amplitude modulator, and a second amplitude modulator supplied with boththe output of said first signal controlled oscillator and the output ofsaid peak detector.

References Cited UNITED STATES PATENTS l/1958 Barney. 2/1958 Miller0U.S. Cl. X.R.

